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ABSTRACT 


Vectorizable  sparse  equation  solution  algorithms  are  classified  by 
the  matrix  structure  which  they  favor.  The  state-of-the-art  for  solution 
of  relatively  dense  systems  is  then  reviewed.  A hybrid  vector  construct 
is  defined  for  the  increasingly  common  structure  of  both  moderate  local 
matrix  density  and  global  matrix  regularity.  Estimates  are  made  of  CRAY- 
1 speedup  achievable  with  this  construct.  A finite  difference  matrix  is 
studied  as  an  example. 


INTRODUCTION 


Direct  solution  of  sparse 
systems  has  enjoyed  wide  applica- 
tion to  simulation  of  lumped  phy- 
sical systems  described  by  ordin- 
ary differential  equations.  Also, 
the  last  decade  has  seen  a move- 
ment toward  implicit  solution  of 
partial  differential  equations 
away  from  explicit  procedures.  An 
excellent  example  is  Navier  Stokes 
aerodynamic  simulation  codes, 
which  have  changed  from  the  purely 
explicit,  through  hybrid  explicit- 
implicit^  and  now  purely  implicit 


and  (3)  presents  new  results  in 
the  detection  of  vectors  in  pat- 
terned sparse  systems.  All  of  the 
experimental  results  were  obtained 
from  the  CRAY-1;  even  the  algorithm 
classifications  to  ba  made  are 
useful  only  for  a memory-hierarchi- 
cal processor  of  the  CRAY-1  class 
with  a range  of  scalar,  short 
vector,  and  long  vector  capabili- 
ties. 


CLASSIFICATION 


procedures^ 


The  vectorization  of  direct 
solution  portions  of  large  codes 
has  an  immediate  aspect  related  to 
the  recoding  of  specific  equation 
solvers  for  a particular  architec- 
ture. Although  most  vector  archi- 
tectures have  at  least  a minimal 
provision  for  sparse  vector  opera- 
tions, an  overhead  is  inevitably 
incurred  in  reduced  memory  band- 
width and/or  the  loading  of  assoc- 
iated bit  maps  and  linked  lists. 

It  is  the  goal  of  research  in 
sparse  matrix  algorithms  to  reduce 
this  overhead  by  re-organization 
of  the  computation  either  (1)  to 
obtain  longer  vectors,  or  (2)  to 
reduce  data  flow,  and  thus  achieve 
an  overall  speedup. 


Consider  the  linear  system 
Ax  * b solved  by  triangular  fact- 
orization of  A into  L and  U.  As- 
sume that  the  factorization  has 
proceeded  by  outer  product  column- 
row  operations  so  that  an  nxn  un- 
reduced system  remains.  The  struc- 
ture of  this  unreduced  system — 
which  includes ' fill  from  the  com- 
pleted portion  of  the  reduction — 
then  becomes  the  principal  issue 
in  determination  of  the  sparsity 
algorithm  to  be  used  during  the 
remainder  of  the  reduction.  This 
is  an  important  generalization 
beyond  examination  of  only  the 
structure  of  A,  since  it  suggests 
the  use  of  different  algorithms 
(polyalgorithms)  as  the  reduction 
proceeds  and  fill  increases  the 
density  of  the  unreduced  portion. 


This  paper  (1)  classifies 
sparse  matrix  characteristics 
amenable  to  vector  processing,  (2) 
reviews  the  state-of-the-art  in 
solving  certain  of  these  problems, 


Four  sparsity  structures  will 
be  considered  at  various  parts  of 
this  paper;  they  are  listed  below 
to  assist  in  unifying  the  later 
discussion.  These  distinguishing 
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attributes  are  related  to  local 
and  global  sparsity  characteris- 
tics: 

(a)  locally  and  globally 
dense,  partitioned; 

(b)  locally  dense,  globally 
unpatterned; 

(c)  locally  dense,  globally 
patterned ; 

(d)  locally  sparse,  globally 
patterned; 

(e)  locally  sparse,  globally 
unpatterned. 

The  last  is  the  least  vectorizable. 
Its  scalar  solution  is  probably 
amenable  to  speedup  only  by  using 
a MIMD  architecture3  and  so  will 
not  be  discussed  further. 

BLOCK-ORIENTED  SPARSE  SOLUTION 
INTRODUCTION 

Two  classes  of  relatively 
dense  matrices  benefit  from  solu- 
tion by  a general  sparse  solver 
which  is  oriented  toward  the  solu- 
tion of  block  structures.  Al- 
though algorithmically  less 
challenging  than  the  sparser  case 
to  be  studied  later,  such  struc- 
tures are  becoming  more  common  due 
to  the  aforementioned  increase  in  the 
implicitness  of  PDE  solution  codes. 

THE  DENSE,  PARTITIONED  CASE 

The  utility  of  a general 
sparse  solver  in  the  analysis  of 
full,  banded,  and  other  dense 
systems  arises  from  vector  length 
limitations  of  the  processor, 
which  in  turn  results  from  a rela- 
tively small  cache  memory  in  a 
hierarchial  memory  system.  Such 
dense  systems  must  be  block- 
partitioned;  in  the  case  of  the 
CRAY-1,  these  partitions  must  be 
limited  in  one  dimension  to  €4, 
the  maximum  vector  length  of  the 
machine.  Using  a general  solver 
avoids  the  writing  of  specialized 
assembly  language  routines  for 
dense  systems  with  globally  dif- 
ferent density  patterns  but  which 
are  partitioned  into  ,'ocally  sim- 
ilar 64-length  or  smaller  dense 
blocks. 

The  processing  of  such  large 
blocks  with  a sparse  solver  can  be 
carried  out  on  the  CRAY-1  with 
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>99.9%  of  the-soluti^n  time  ii 
numeric  kernels,  and  with  <.1% 
processing  of  lists  resulting  from 
the  general  sparsity  assumption. 

A variety  of  common  compressed 
storage  schemes  can  also  be  accom- 
odated4. 

LOCAL  DENSITY 


Moderate-sized  dense  blocks 
occur  naturally  from  the  represen- 
tation of  variable  and  equation 
coupling,  from  nodal  coupling  in  a 
grid,  and  from  coordinate  trans- 
formations, among  other  causes. 

In  the  absence  of  other  vectoriza- 
tion  strategies  (to  be  discussed 
shortly) , it  becomes  necessary  to 
reduce  the  system  a block  at  a 
time  with  dense  matrix  kernels  of 
a block-oriented  sparse  solver. 
Descriptors  of  the  location  and 
size  of  the  block  suffice  to  quide 
the  solution  of  such  a system4. 

The  overhead  of  list  process- 
ing of  the  blocks  may  be  compen- 
sated by  finely-tuned  numeric 
•kernels,  with  the  net  result  that 
a general  solver  can  execute  at  a 
higher  rate  than  a conventionally- 
coded  specialized  solver4. 


The  execution  rate  is  of 
course  highly  dependent  on  the 
matrix  sparsity  structure.  How- 
ever, a timing  model  of  the  num- 
eric kernels  and  the  list  process- 
ing overhead4  allows  the  esta- 
blishment of  MFLOPS  bounds  for 
matrices  of  constant  block  sizes 
but  arbitrary  block  sparsity 
patterns.  Such  bounds  are  given 
in  Table  1 for  the  CRAY-1.  The 
minimum  rate  is  achieved  with  a 
single  off-diagonal  block  (e.g., 
block  tridiagonal)  and  the  maximum 
with  r off-diagonal  blocks  (Figure 
1) , as  r~«. 


LOCALLY  DENSE,  GLOBALLY  PATTERNED 
SPARSE  SYSTEMS 

INTRODUCTION 

Table  1 shows  that  processing 
block  sizes  with  dimensions  below 
10  utilizes  a small  fraction  of 
the  CRAY-1  processor  speed.  To 
regain  a high  processing  rate, 
another  structural  property  be- 
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Block 

sizes 


MFLOPS 

range 


2 

1.9 

- 

7.6 

3 

5.0 

- 

17. 

4 

10. 

- 

26. 

6 

21. 

- 

43. 

8 

32. 

- 

60. 

12 

54. 

- 

84. 

16 

69. 

- 

98. 

32 

102. 

- 

124 

64 

126. 

- 

141 

Table  1.  Performance  of  general 
block  sparse  system 
solver  on  the  CRAY-1 


►r  blocks 


Diagonal 
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blocks 


Figure  1. 


Model  of  block  pivot 
step 


sides  density  should  be  exploited. 
It  is  proposed  to  utilize  global 
similarities  or  patterns  to  length- 
en density-related  vectors.  These 
will  be  termed  hybrid  vectors  and 
are  the  subject  of  the  remainder  of 
this  paper. 

A "bottom-up"  approach  will  be 
used.  After  defining  and  illustra- 
ting the  model  hybrid  problem,  it 
will  first  be  demonstrated  that  the 
CRAY-1  can  achieve  considerably 
higher  execution  rates  on  hybrid- 
related  kernels.  Then  it  will  be 
shown  how  such  hybrid  vectors  can 
be  achieved  with  common  finite 
difference  (or  finite  element) 
structures. 

GLOBAL  vs.  LOCAL  PROPERTIES 

In  establishing  the  hybrid 
vector  concept,  it  will  be  useful 


to  use  the  notion  of  the  graph  of 
a matrix. 

The  non-zero  structure  of  a 
matrix  A,  where  A is  structurally 
symmetric,  has  a convenient  graph 
theoretic  formulation.  Assume 
that  i*l,2,...n.  Let  V ■ 

{vi»V2» • • .vn) , with  the  v^  termed 
vertices  and  V the  vertex  set. 
Define  a set  P of  ordered  pairs  of 
V,  called  edges,  by  (Vi,Vj)€P  if 
and  only  if  a^jO  and  kp  j . Then 
G=G (V, P)  is  called  the  graph  of  A. 
Mote  that,  because  the  matrix  is 
structurally  symmetric,  (Vi*Vj)6P 
if  and  only  if  (Vj,v^)6E  P. 

To  illustrate  the  relation- 
ship between  local  and  global  pro- 
perties, consider  the  subgraphs 
Gi  and  G2  of  Figure  2(a).  These 
subgraphs  are  possibly  connected 
by  paths  through  vertices  not 
shown,  but  are  assumed  to  be  not 
directly  connected . If  the  assoc- 
iated equations  are  arranged  in 
the  numbered  order,  the  partial 
matrix  structure  of  Figure  2(b)  re- 
sults. This  structure  is  locally 
dense  (contrast  full)  but  globally 
sparse,  since  the  two  dense  sub- 
matrices are  not  coupled  in  the 
northwest  matrix  partition.  If 
the  equations  are  reordered  so 
that  similarly-connected  nodes  are 
consecutively  ordered,  then  each 
of  the  resulting  16  partitions  is 
either  a diagonal  or  a null  sub- 
matrix (Figure  2(c)).  Because 
most  sparse  blocks  are  coupled  to 
other  sparse  blocks  by  diagonal 
coupling  blocks,  the  matrix  struc- 
ture is  now  termed  globally  dense. 
(It  may  be  noted  that  the  local 
density  pattern  of  each  dense 
block  of  Figure  2(b)  is  identical 
to  the  global  density  pattern  of 
Figure  2(c).)  The  factorization 
of  the  northwest  corner  of  the 
system  matrix  may  utilize  any  algo- 
rithm, independently  of  the  algo- 
rithms used  to  reduce  the  remain- 
der of  the  matrix. 

If  the  connection  symmetry 
between  the  two  sets  of  nodes 
undergoing  reduction  extends  to 
their  interconnections  to  other 
unreduced  nodes  as  in  Figure  3(a), 
and  if  these  unreduced  nodes  are 
properly  ordered  as  shown,  then 
tne  northeast  and  southwest  parti- 
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(a)  Similar  subgraphs  and  connect- 
ions (o-node  undergoing  reduc- 
tion; e-unreduced  node) . 


(b)  Locally  dense,  globally  sparse 
corner 


(c)  Locally  sparse,  globally 
dense  corner. 

Figure  2.  Relationships  between 
local,  global  matrix 
properties 


(b)  Associated  matrix 

Figure  3.  Similar  subgraphs  with 
similar  connections  to 
rest  of  graph. 


tions  can  be  made  to  contain  simi- 
lar diagonal  coupling  matrices 
(Figure  3(b). 

KERNEL  STUDY 

In  solution  of  large  sparse 
systems,  the  multiplication/accum- 
ulation (M/A)  kernel  dominates 
other  numeric  kernels.  Elimina- 
tion of  a strip  of  row  and  column 
blocks  symmetrically  coupling  a 
diagonal  block  to  r other  diagonal 
blocks  (Figure  1)  requires  (a)  fac- 
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A 


B 


torization  of  a diagonal  block, 

(b)  r block  forward  and  back  sub- 
stitutions, and  (c)  r2  multiplica- 
tions/accumulations. For  r=3  (a 
common  number  for  dissected  finite 
element  and  finite  difference 
grids) , 69%  of  the  operations  are 
of  the  M/A  type.  The  M/A  kernel 
therefore  warrants  principle  study. 


The  nature  of  the  M/A  model 
kernel  with  both  local  diagonal 
sparsity  and  global  density  is 
illustrated  in  Figure  4.  It  is 
proposed  to  study  the  execution  of 
the  kernel 

C - C t A*B  (1) 

where  A,  B,  and  C are  illustrated 
in  the  figure. 


B *r  blocks 


blocks  * unreduced 


matrix  partition 

Figure  4.  Example  of  model 
problem 

The  preference  for  processing 
hybrid  kernels  can  be  expected  to 
arise  from  the  interconnection  or, 
more  generally,  the  data  flow  pro- 
tocol of  the  processor.  For  the 
CRAY-1,  two  recursive  features  of 
the  vector  registers  permit  high 
performance  M/A  kernels. 

4-Matrix  M/A.  The  pattern  illus- 
trated by  the  matrix  multiply 


represents,  on  equation  reordering, 
the  simultaneous  multiplication  of 
four  3x3  full  matrices.  It  is 
proposed  to  implement  the  assoc- 
iated accumulation  kernel  by  form- 
ing 


i.a.,  by  accumulating  a column  of 
diagonal  blocks  of  C.  To  perform 
each  term  of  the  summation  by  a 
single  chained  multiply-add  vector 
operation  with  the  CRAY-1  requires 
chain  replications  of  the  4-length 
Bk:  to  a 12-length  vector  so  that 
the  overhead  of  the  replication 
does  not  seriously  impact  the  over- 
all timing. 

The  basis  of  this  replication 
is  the  recursive  feature  of  the 
vector  logical  pipeline,  whereby, 
if  the  same  vector  register  is  both 
operand  and  result  register-- 
usually  prohibited  in  register 
allocation — data  will  be  delayed 
four  clocks  in  the  pipeline  and 
the  desired  replication  achieved. 
Figure  5 gives  the  CAL  instruction 
sequence  and  the  clock  level  report 
of  a part  of  the  accumulation  loop, 
as  reproduced  by  a CRAY-1  timing 
simulator5. 

Table  2 gives  the  execution 
rates  of  a complete  4-matrix  mul- 
tiply, in  comparison  with  the  rates 
of  two  full  matrix  multiply  kernels 
previously  studied.  The  standard 
full  kernel  for  short  vectors  pro- 
duces large  gaps  in  the  floating 
point  pipelines  due  to  the  chain- 
ing®*7. The  high-performance 
matrix  multiply  kernel  avoids 
chaining  and  the  consequent  gaps 


but  suffers  from  register  and  pipe- 
line reservations  and  addressing 
overhead  resulting  from  four  sep- 
arate invocations  of  the  full  mat- 
trix  multiply.  Table  2 shows  that 
nearly  three  times  the  execution 
rate  is  achieved  for  multiplication 
of  four  4x4  matrices  with  the  spec- 
ialized kernel,  in  comparison  with 
the  standard  CAL  kernel. 

8 Matrix  M/A.  A similar  recursive 
feature  of  the  addition  pipeline 
allows  the  rapid  accumulation  of 


8-length  vectors  and  consequently 
the  simultaneous  multiplication  of 
8 matrices.  This  is  a well  known 
feature  described  in  [8]  and  will 
not  be  discussed  here.  It  suffices 
to  note  in  Table  2 the  extraordin- 
ary speedups  achievable  with  very 
small  matrices.  However,  the 
execution  rate  has  a large  dis- 
continuity between  n«7  and  n*8, 
due  to  the  nature  of  the  algorithm, 
and  is  less  desirable  beyond  n=7 
than  a 4-matrix  multiply. 
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output  for  4-matrix  accumulation  instruction  se- 

quence  for  VL  - 64.  VI  is  replicated  into  V3  with  VM  - 0077.. 7g. 
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Full 


H.igh- 


8- 


matrix  Stand,  perf.  matrix  matrix 


size 

full 

full 

hybrid 

hybrid 

2 

7.6 

8.7 

17.2 

22.1 

4 

19. 

30. 

54.0 

55.7 

6 

NT* 

NT 

76.3 

88.5 

8 

43. 

64. 

90.2 

72.5 

10 

NT 

NT 

97.8 

90.8 

16 

88. 

102. 

119. 

— 

* Not  tested 

**  Positive  accumulation  only 

Table  2.  Execution  rates  (MFLOPS) 
of  matrix  multiply  ker- 
nels. Subroutine  entry 
and  exit  overhead  is  not 
included. 


AN  ALGORITHMIC  OVERVIEW 


An  important  algorithmic  pro- 
perty of  the  above  hybrid  vector 
construct  is  that  the  kernel  vector 
length  is  proportional  to  the  pro- 
duct of  factors  related  to  (1)  the 
local  matrix  density  and  (2)  the 
global  matrix  patterns.  In  less 
precise  terms,  one  may  claim  the 
length  is  the  product  of  local 
coupling  and  global  decoupling. 


In  solution  of  large  init- 
ially sparse  patterned  systems, as 
the  reduction  progresses  fill 
causes  the  coupling  to  increase 
and  the  decoupling  to  decrease, 
leaving  the  possibility  that  their 
product  remains  a relative  constant. 


Such  a result  could  produce  a 
very  useful  generalization  of 
previous  work  (ref.  (9] [10])  where 
vector  lengths  were  assumed  a func- 
tion of  the  local  density  only  or 
global  patterns  only11.  To  length- 
en vectors  by  increasing  local 
density,  previous  algorithms  were 
inevitably  driven  to  an  increase 
in  the  arithmetic  computational 
complexity1^. 


The  following  study  can  be 
considered  an  initial  investiga- 
tion into  the  production  of  hybrid 
vectors  for  finite  difference  grids. 


A number  of  algorithmic  questions 
will  be  left  unanswered,  a topic 
for  continuing  research. 


GENERATION  OF  HYBRID  VECTORS 
INTRODUCTION 


Given  the  graph  of  a matrix, 
it  is  proposed  to  perform  opera- 
tions on  this  graph  which  yield 
hybrid  vectors  in  the  matrix  re- 
duction with  either  no  increase  or 
a determinable  increase  in  the 
arithmetic  operation  count.  The 
example  of  a 5-point  2-D  finite 
difference  grid  will  be  used  to 
illustrate  the  procedure,  because 
of  its  connection  regularity  and 
because  its  solution  by  nested  dis- 
section is  characterized  by  exploi- 
tation of  decoupling  to  achieve  a 
reduced  arithmetic  operation  count 
for  large  grids.  The  reader  is 
assumed  to  be  familiar  with  this 
dissection  process1^ > 11 . 


FOLDING  AND  ROTATION 


The  (diagonal)  nested  dissec- 
tion of  a 5x5  grid  proceeds  by  re- 
cursively dividing  the  grid  into 
quadrants  until  each  quadrant  con- 
sists of  a single  node.  This  div- 
ision is  performed  along  diagonal 
separators,  which  are  lines  of 
nodes  whose  removal  divides  the 
graph  into  unconnected  parts. 


It  is  clear  from  Figure  6(a) 
that,  since  the  quadrants  have  a 
similar  structure,  "similarly-pos- 
itioned" vertices  not  on  a separator 
may  be  eliminated  simultaneously 
with  vector  operations,  without 
increasing  arithmetic  computation. 
These  vectors  will  be  of  length 
four,  as  required  for  the  4-matrix 
kernel  of  Figure  5. 


"Similarly-positioned"  nodes 
can  be  generated  by  overlaying  the 
quadrants  so  that  a single  node  in 
the  overlay  represents  4 nodes. 

This  single-quadrant  representation 
of  the  4 quadrants  may  be  achieved 
by  folding  or  rotating  the  original 
graph.  This  rotation  process  is 
illustrated  in  Figure  6 (a) -(b);  a 
recursive  folding  process  - which 
generates  vectors  of  decreasing 
length  - is  discussed  in  ref.  [14). 


Because  the  non-separator 
nodes  are  eliminated  first  in  the 
nested  dissection  process,  these 
interior  nodes  are  represented  by 
the  northwest  corner  of  the  system 
matrix;  this  corner  is  consequent- 
ly guaranteed  to  consist  of  4x4 
blocks  with  either  diagonal  or  null 
structure.  The  southeast  corner 
of  LU,  representing  the  reduction 
of  the  separator  nodes,  is  dense 
and  can  be  reduced  at  execution 
rates  exceeding  100  MFLOPS.  The 
northeast  and  southwest  partitions, 
however,  represent  coupling  between 
the  separators  and  the  interior 
nodes  of  the  quadrants.  Asymptoti- 
cally in  the  grid  dimensions, 
operations  involving  these  two 
partitions  consist  of  approximate- 
ly 30%  of  the  total,  so  that  the 
choice  of  a proper  kernel  is  im- 
portant (Figure  7).  Irregularity 
in  these  coupling  matrices  results 
in  part  from  the  separator  nodes 
shared  by  Q1  and  Q4  (nodes  #1,  #7 
and  #13  in  Figure  6(a))  in  the 
rotation  sequence.  The  regularity 
may  be  restored  by  cutting  the 
graph  along  the  boundary,  adding 
nodes  and  associated  unknowns,  and 
adding  equations  that  relate  the 
new  nodes  to  the  originally  shared 
ones.  This  cutting  process  is  il- 
lustrated in  Figure  6(c) -(d). 

The  structure  of  L and  0 re- 
sulting from  application  of  such 
cutting  to  a 17  x 17  finite  dif- 
ference grid  is  shown  in  Figure 
8 (b) . The  conventional  nested 
dissection  ordering  for  the  same 
matrix  yields  the  LU  map  of  Figure 
8(a).  Coupling  in  the  northwest 
partition  appears  as  4x4  diagonal 
blocks,  as  predicted;  coupling  in 
the  northeast  and  southwest  par- 
titions appears  as  4-length  stripes, 
but  not  necessarily  as  4x4  diagonal 
blocks.  Thus  a somewhat  modified 
4-jnatrix  accumulation  kernel  would 
have  to  be  used.  The  addition  of 
the  8 nodes  along  the  cut  also  in- 
creases each  dimension  of  the  dense 
southeast  corner  of  LU  by  approxi- 
mately 25%.  The  total  increase  in 
computation  resulting  from  cutting 
is  as  yet  undetermined. 

SUMMARY 

Vectorization  and  data  flow 


Figure  6.  Rotation  of  quadrants  (a) 
into  single-quadrant 
representation  (b) ; ro- 
tation with  cut  and 
creation  of  nodes  ((bi- 
te)). 


for  a memory  hierarchical  process- 
or add  two  new  issues  to  be  con- 
sidered in  the  development  of  codss 
for  the  direct  solution  of  2-D 
finite  difference  grids.  What  is 
a single  algorithm  class — nested 
dissection — for  a scalar  machine 
now  divides  into  subclasses  of 
algorithms,  of  which  the  above 
proposal  is  only  one.  Checker- 
board and  related  ordering  strat- 
egies are  also  attractive;  pre- 
liminary estimates  indicate  such 
codes  will  execute  over  100  MFLOPS 
on  the  CRAY-115,  which  can  partial- 
ly compensate  for  increased  arith- 
metic computation. 


accum:  4-matrix 

% opns:  50 

MFLOPS:  Table  2 

accum:  ? 

accum:  dense 

% opns:  30 

* opns:  20 

MFLOPS:  ? 

MFLOPS:  >100 

(Table  1) 

♦Approx,  percent  of  total  arith- 
metic operations. 


Figure  7. 


Estimated  asymptotic 
perf.  of  polyalgorithm 
to  perform  nested  dis- 
section in  four  matrix 
partitions. 


SPARSE,  PATTERNED  SYSTEMS 

As  the  coupling  in  A,  B,  and 
C of  Figure  4 decreases,  each 
approaches  a diagonal  matrix.  The 
accumulation  then  involves  at  least 
two  vector  loads  (and  usually  one 
vector  store)  for  each  floating 
point  M/A  operation  and  sufficient 
list  processing  to  locate  at  least 
two  of  the  matrices  in  memory.  An 
accumulation  kernel  written  for 
the  CRAY-1,  includin?  list  process- 
ing and  a vector  store  for  each 
accumulation,  executes  at  the  rate 


MFLOPS 


53.3  (, 


with  the  maximum  value  of  35.8  for 
l = 64.  This  is  less  than  1/4  the 
asymptotic  rate  of  a dense  accumu- 
lation. The  kernel  is  memory  bound 
and  involves  significant  start  up 
time  for  the  relatively  small  float- 
ing point  computation  involved. 

CONCLUSION 

While  the  speed  of  vector 
processors  encourages  the  formula- 
tion of  denser  systems,  their  in- 
creasingly parallel  design  favors 
the  construction  of  longer  vectors 
that  can  be  distributed  across 
many  pipelines  operating  concur- 
rently. In  this  paper,  the  vector- 
lengthening advantages  of  the  hy- 
brid vector  construct  have  been 
shown  at  the  kernel  level  and 
methods  have  been  proposed  to  pro- 
duce such  vectors  directly  from 
the  problem  structure. 

From  the  algorithm  viewpoint, 
the  direct  relationship  between 
problem  and  processor  structure 
offers  novel  insight  possibly  use- 
ful in  developing  a family  of 
equation  ordering  techniques 
based  on  folding,  rotation, etc . 

It  is  also  hoped  that  a high  per- 
formance software  package  may  be 
developed  for  specific  2D  grid 
geometries. 

From  the  viewpoint  of  process- 
or architecture,  this  paper  has 
quantified  the  motion  that  the 
less  dense  the  system,  the  more 
data  flow  and  other  accumulation 
kernel  overhead  is  required.  A 
patterned  system  may  permit  the 
lengthening  of  vectors  - which  re- 
duces the  influence  of  overhead  - 
but  does  not  significantly  alter 
the  data  flow  problem. 
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